Preamble

Why we’re doing this

We are doing this because we want extract insight out of data. We want to be able to learn something from the information we have.

This workshop—and my own learning in R—uses a number of key ideas from Hadley Wickham’s R for Data Science.1 It is a wonderfully written book with a beginner audience in mind, and goes into far greater depth than this workshop can. However, with this workshop behind you I hope you’ll be able to jump into R for Data Science with confidence. From its introduction is this (adapted) graphic:

It lays out the project of any data science work:

  • Find and read data into R.
  • Explore data
    • Transform data: add things you want, and remove things you don’t.
    • Visualise data: see what’s happening in your data.
    • Analyse data: perform statistical analysis on your data to draw.
  • Communicate your findings.

And, importantly, we’re here to have fun. To make fun and exciting visuals like

[EXAMPLE FROM BBC? or elsewhere that have used R]

Using a script-based program keeps a record of everything you do

R is a script-based language. You write down a list of instructions and it will follow, performing one action after another. This is different to ‘point and click’ software like Microsoft Excel, and it can feel a bit cumbersome.

In Excel, you can perform a series of steps:

  • Open a file.
  • Delete column that you don’t need.
  • Select all of your data and sort from highest to lowest.
  • Delete all the rows that have missing values.
  • Save your file.

But, there is no record of any of this. You might look back at the data in a week’s time and not know what rows you have deleted. This is particularly important when you go to write up your metholodolgy for an assignment or thesis. You might not be able to recreate your own work.

R is a script-based program. You write a list of instructions and R will follow it. This is wonderfully handy:

  • everything you have told it to do, it will do;
  • you have recorded everything you have told it to do.

Using a free, open-sourced program is beneficial

There are many proprietary script-based programs for analysing data: Stata, SAS, Eviews, SPSS, Matlab. I imagine you will encouter a few of these during your undergrad.

Proprietry software costs money

They cost $$$. If you are not at a university/workplace that has the program, you can’t use it. Even if you are at a place that has a license, you might have to use a special computer lab to use the program. Or you might have to pay for a license yourself.

Importantly, the cost means you are less likely to play around with the program in your spare time or for non-university activities. Google Sheets is a good example of this: people make budgets or plans in Sheets and, even if they’re simple, you’re learning as you go.

Proprietry software is centrally controlled

Proprietary programs are also centrally controlled. Their functions are written by the company, and you can only use the set of functions they provide.2

R is free

R is free. You can use it from any computer at any time. The analysis you do with R in your undergrad will be reusable when you graduate. It also means that you will be able to display the code of your analysis in a portfolio of work when you’re looking for jobs in the future.

The free-ness of R means more people use it. R is known for a thriving community, meaning you can quickly search for how to DO THE THING I WANT TO DO in R and find an answer from wonderful people on the internet. People on the R internet are wonderful, and you’ll soon feel like this:

‘Me “working independently”’ by Allison Horst, @allison_horst

R is open-source

R also thrives on user-written packages (collections of functions) that are available to everyone, for free. A literature review by Robert Muenchen for r4stats.com noted:3

In 2015, R added 1,357 packages…or approximately 27,642 functions. During 2015 alone, R added more functions than SAS Institute has written in its entire history.

Digression: ideally, as more and more faculties in universities around Australia make R their program-of-choice, you’ll be able to spend more time mastering a single program (and properly understanding what you are doing in that program), and less time fumbling around with a new language for each subject you take.

The problems with open-source software

The main downside to free, open-source software is that there are many ways to do the same thing. If we had a dataset that looked at country population by year, and we only wanted to keep the year 2007, we could do it in five ways (or more!):4

  • Subset using brackets by extracting the rows where year is equal to 2007
  • Subset using brackets by omitting the rows where year is not equal to 2007
  • Subset using brackets in combination with the which() function and the %in% operator
  • Subset using the subset() function
  • Subset using filter() function from the dplyr package

This makes searching for answers a little bit more difficult. However, this workshop uses tidyverse syntax (which includes filter from the dplyr package in the last example above), so I advise phrasing your Google search as how to DO THE THING I WANT TO DO r tidyverse. This will remove a least a bit of the confusion.

Getting started in R

Now that you are completely convinced that R is the best, let’s get into it.

How to install R and R Studio

You will need to download and install R and R Studio. (And you can skip this subsection if you already have!).

R is the language and the program. Think of it as the engine that powers the things you do. You can download it for:

Once downloaded, follow the prompts to install. Restart your computer if required.

R Studio is the interface you will use R with. The technical term is an ‘integrated development environment’ (IDE) for R. Think of it as the dashboard that shows you all the things you’ve got going on in R. You can download it for:

Then, follow the prompts to install and restart if required.

R Studio layout

R Studio is an integrated development environment (IDE) and is how we will interact with R. It looks like this:

The four panes are labelled in \(\color{green}{\text{green}}\) :

  • Top left: your script. This is where you will write your code.
  • Bottom left: your console. This is where your code will be sent to, and where some of your output will show. You can also write code directly into the console; but there will be no record of it in your script!
  • Top right: your environment. This lists all the things that R knows about so far. If you define an object, it will appear here.
  • Bottom right: a bit of a miscellaneous pane. Files list the files in your project. Packages lists the packages you have installed and loaded. Viewer is where our charts will be displayed.

I know this can all look a bit intimidating the first time you see it. That’s okay! We’ll get to know R Studio more as we go through this course.

Good folder structure and R Projects

Good folder structure is tedious and abstract and not-at-all-fun but it makes everything in the future easier. It simply means you have:

  1. One folder for each project. Here, the example is introduction_to_R. Your projects might be something like econometrics_assignment2 or honours_thesis. Whatever the project, everything you need for it is contained within the main project folder.
  2. A set of standardised subfolders. For example, keep the data you are using in a folder called data. Keep output (tables, charts, etc) in an output folder. Note that you can set these up how you like: but consistency makes it easier for you to switch between projects and collaborate with others.

Your script will often ask for things on the computer. For example, “read in this dataset” or “save this chart to a place”. For that, we have to tell the computer where it is. We can do that in two ways: 1) by explicitly setting a working directory (which is not the best way), and 2) by setting up a R Project (which is the best way to do it).

Excplicitly setting a working directory is a bad way to tell your computer where it is**:

This is sometimes done by ‘setting a working directory’. This means having a line in your script that says ‘this is where we are’:

This is problematic and frustrating because your directory path won’t be the same as your collaborators’ or tutors’ (unless you have the same name and the same operating system!).

From Hadley Wickham’s R for Data Science:

But you should never [use setwd to set a working directory] because there’s a better way; a way that also puts you on the path to managing your R work like an expert.

^ That’s you! You’re already an expert.

Using R Projects is a good way to tell your computer where it is:

The best way to tell your computer where it is is to use R Projects. This is a little file that lives in your project folder with the suffix .Rproj. Opening this file opens R and sets your working directory to where it is.

This is beneficial because it means you don’t have to write setwd("Users/yourname/Documents/myRfolder/this_project_of_mine") on every single script you write. It also means that your collaborators can open your project folder on their computer and all scripts will run without a hitch.

You can set up an R Project by clicking File -> New Project and following the prompts.

Comments in your code

‘Comments’ are notes that live in your code and are preceded with #. R will ignore anything after the # symbol:

## [1] 14

If we forgot to use the # to ‘comment’ things, we would generate a bunch of errors as R tries to work out what you’re on about:

## Error: <text>:1:6: unexpected symbol
## 1: This is
##          ^

Objects

Objects are ‘things’ that R knows about. They can be simple as a single number, or more complicated like an entire dataset.

You can tell R about an object using the assign <- operator:

We have said: take the number 12 and assign it <- to the object fave_number. R will note this down, put fave_number in our environment and we will be able to use it when we want later. Like:

## [1] 24

Or:

## [1] 49

We can have lots of objects stored in our environment, and we can call them whenever we want (after we have defined them):

## [1] 17

Functions

A function takes inputs (aka ‘arguments’) and produces outputs.

We can use the c function to combine (concatenate) numbers into a series of numbers (a vector):

## [1] 3 4 5

The output above, like all output in this document, is preceded by ## and then [1], meaning the first line of output. Here, the output is the vector of numbers we entered into the c function.

We can also nest functions, meaning we have one function inside another function. For example, we can combine numbers into a vector using the c function, then we can take the average (mean) of the vector:

## [1] 4

But nested functions are a bit difficult to read. You have to start from the inside and read outwards. Alternatively, we could assign our vector to an object using the assign <-operator:

## [1] 4

This will make changes in our environment to the top-right: it adds the object goodnumbers.

It will also produce output in the console to the bottom-left: the mean of goodnumbers.

It should look something like this:

Packages

R uses functions to do things. A package is a collection of functions. Some packages are installed when you install R, like base and stats (this is called “base R”).

A wonderful benefit of R is being able to use the community’s collection of functions. The tidyverse is a collection of functions that make data wrangling and visualisation much easier. And we can use this package for free.

Installing a package is like installing an app on your phone or computer: you need to do it, and you only need to do it once.

You can install a package using the install.packages function. Note that there will be lots of text that appears when installing a package. This text doesn’t make great reading, but it lays out what the package is doing as it installs. If there is an error installing, this is where you’ll find some (hopefully) useful infromation about why.

Now we need to load the package using the library function, like opening an app you have installed. We do this every time (every ‘session’) we want to use it.

Some notes before we start

In my experience learning R, there are two things I would like to prepare you for:

You will make lots of errors and this will be frustrating

R will follow your instructions to the tee. If you misspelll something or put an argument where it doesn^t belong, R will try to do the thing you asked and it will fail. Hopefully, it will clearly tell you what happened; other times it will be vague.

This means you will spend a lot of time debugging your code: you think what you’ve written makes sense, but you get an error. You try to work out what’s gone wrong; you play around; you search the internet; you give up on R; you go outside and enjoy your R-free life;

Make sure your brackets add up


We’re coders now

We have installed R and R Studio; installed and loaded packages; and we know about objects and functions.

We’re coders. And now we just have to learn how to do more:

This is a process you will continue as long as you’re using R.


Reading and exploring with visuals

The preamble is meant to get you up-and-running in R, which is necessary. But it is a little bit tedious and boring. In this part, we will go through some fun things like creating graphics. It will follow:

  1. Having a question.
  2. Reading data into R.
  3. Looking at our data plainly (like we would an Excel spreadsheet).
  4. Exploring data visually and interactively.

Read a CSV file into R

This uses the read_csv function and, here, we’re only going to give it one argument: the path to the csv file you want to read in quotation marks.

Tip: open quotation marks and hit tab to navigate to your file (and save you some typing).

## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

Looks good! But what do we have?

We have a tibble: a tidy version of a column/row dataset. This means every observation is a row, and every variable is a column.

The first line of output says that we have 1704 rows of observations \(\times\) 6 columns of varaibles. And from the output we read into R we can see that Afghanistan (country) had a life expectancy (lifeExp) of 28.8 in (year) 1952.

The read_csv function seems to have worked pretty well, and our output makes sense. But the output—our data—isn’t in our environment (on the top-right) yet because we didn’t assign it to anything. We assign something using <-, meaning we can call on it later.

Now it is in our environment on the top-right of our R Studio window. This means we can use gapminder to use it later when we want it.

Peeking at the data

Much like Excel, we can explore the raw gapminder dataset with our eyes.

View will open up a new tab that displays your dataset. You can scroll through it and look at each row and variable in your data.

If we just want a quick look, head will print just the first few observations. This is handy to check on things as you’re going along.

## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

names will display the names of all variables in the dataset (and is often the answer to ‘what was that variable called again…’)

## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

We should start by visualising our data

It is important to visualise your data to properly understand it. The why is explored more in Part 2. For now, let’s get into it:

To visualise data we first need to have questions. Our first question is one of impartance:

Question 1: what is the relationship between life expectancy and income per person?

We have the data to answer this question. When we looked at the gapminder dataset, we saw that there were variables for life expectancy lifeExp, and for income per person gdpPercap in each county in each year.

Our first ggplot

So we can take the gapminder dataset, generate an empty plot using ggplot and fill it with a point: lifeExp on the y axis, and gdpPercap on the x axis:

What just happened there? We used the ggplot2 package, which stands for the layered grammar of graphics plot5

Our plot has a few components:

  • Data: our gapminder object.
  • An ggplot using the ggplot function.
  • A geometric object: we have chosen geom_point to plot dots.
    • Aesthetic mapping: within geom_point we have defined an aesthetic with aes and mapped the x axis to represent lifeExp, and the y axis to represent gdpPercap.

We can see how that plays out by creating an empty plot:

Then fill it with out data:

And then we have to explain how to map the data to the plot using aesthetics aes, which we add to the plot using + (noting that the + comes at the end of a line).

Start with an empty plot, then layer on a geometric object geom with some aesthetics aes(). We can add more layers to the same plot.

Exploring interactively with plotly

We can use the plotly package to interactively explore the scatter plot. First, install the plotly package (remembering quotation marks when we install a package)

Then load the package using library:

Then we define our normal-old-plot as an object using <-. This time we are going to map country to the label aesthetic so we can see which countries are which:

And place the plot in the ggplotly function:

Now we can explore the plot! Move your mouse over the plot and work out what that high income/medium life expectancy country is.

Changing scales

Our plots so far have been on ‘linear’ scales: they go up by the same amount for the whole scale.

But this might not always be the right scale for things that happen exponentially. For example, we might expect that income grows much faster than health does. So we decde to examine the relationship between (exponetial) GDP per capita and life expectancy. We want to use a log10 scale on the y axis (where income is); so we add + the scale_y_log10 function to our plot:

Adding a trendline

To add another geom to the same plot, we use the and + symbol, then add our new set of instructions on a new line. A trendline is called using geom_smooth:

Note: there is a whole library of geoms to explore at https://ggplot2.tidyverse.org/reference/.

The trendline helps us with our question: we can see that, overall, the higher income is correlated with higher life expectancy. This is a result we would expect.

But the way we are looking at this data might be hiding some important insights. We should explore them.

Adding colour

Our scatter plot so far shows two aesthetics (aes): lifeExp mapped to the x axis, and gdpPercap mapped to the y axis.

Let’s look at our variable names again and explore if we can squeeze some more information out of this plot:

## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

It might be interesting to see how things vary by continent. So let’s map continent to the colour aesthetic. We do this in the same way we mapped life expectancy and GDP per capita to the x and y axes:

We have repeated ourselves a bit here. For each geom we have set the aesthetics: “this is the x axis”, “this is the y axis” and “this is the colour”. To save ourselves some typing (and ensure we’re being consistent)lh we can set the aesthetics in the first ggplot function. The geoms that follow will ‘inherit’ these aesthetics:

Adding size

Just like we mapped colour to a country’s continent, we can add size—the size of the points—to a variable (column) in our dataset. Let’s map size to population (pop).

Setting global rules

We have mapped variables to geoms. This means they will take a value (x, y, colour, etc) depending on their variable’s value (lifeExp, gdpPercap, continent, etc).

But what if we just wanted to set a rule? Say, what if we wanted the colour of all points to be blue? Or the transparency of all points to be 50% regardless of their country, lifeExp or gdpPercap?

Remeber that aes values are mapping values: they map a variable to a thing, and we keep those mappings in the aes() function.

If we want to set a rule outisde of an aesthetic, we do just that:

What would happen if we put colour = "blue" inside of aes? ggplot would do as it’s told: it would choose a colour for each value of the character string “blue”. As there is only one value, our chart looks like this:

But we got good information from colouring continents by colour, so let’s not throw that away. We can instead say “set the transparancy to 50%”. Transparancy in the ggplot-world is alpha. So we’ll set alpha = 0.5 to the geom_point geometry.

Creating small multiples with facets

This is all getting a bit busy. It might be clearer to see each continent on its own separate chart. We can do this by adding a ‘facet’ to our plot: i.e. taking our chart, and plotting it ‘around’ ~ another variable.

The code above says: take the chart we (now) know and love, and do that for each continent in our dataset showing each seperately.

A trendline might be nice, and we can do that by adding it to our chart object:

And, like we did above, we can make this plot interactive with ggplotly:

Making your plots dance with animation

We defined a chart—an object uncreatively called chart—and, because we’re only human, we would like to animate it into a gif to see it move over time.6

Animating a plot is easy thanks to the gganimate package. It follows the same ‘grammar of graphics’ structure as ggplot, and we just tack it on the end of our plot-making.

First we install the gganimate package, which has been (re-)built by Thomas Lin Pedersen7

Once the gganimate package is installed, we load it using library and add a few bits to our chart. (Note that following it will take a minute or two to build the animation)

And we can save the animation using anim_save (this will save the last animation you created by default).


Part 2: this is a mess

We have produced some nice graphics looking at the relationship between a country’s GDP per capita (per person) gdpPercap, and that country’s life expectancy lifeExp. Now we want to look deeper.

Looking at summary statistics (and why we need to visualise the data, too)

Most data is too ‘big’ to look at or make sense of individually. We often use summary statistic to get an understanding for the data. You’ve done this before by taking the mean (average) of a series of numbers; or by looking at how they’re correlated.

Below, we’ll use a dataset from

## # A tibble: 13 x 6
##    dataset    mean_x mean_y std_dev_x std_dev_y corr_x_y
##    <chr>       <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
##  1 away         54.3   47.8      16.8      26.9  -0.0641
##  2 bullseye     54.3   47.8      16.8      26.9  -0.0686
##  3 circle       54.3   47.8      16.8      26.9  -0.0683
##  4 dino         54.3   47.8      16.8      26.9  -0.0645
##  5 dots         54.3   47.8      16.8      26.9  -0.0603
##  6 h_lines      54.3   47.8      16.8      26.9  -0.0617
##  7 high_lines   54.3   47.8      16.8      26.9  -0.0685
##  8 slant_down   54.3   47.8      16.8      26.9  -0.0690
##  9 slant_up     54.3   47.8      16.8      26.9  -0.0686
## 10 star         54.3   47.8      16.8      26.9  -0.0630
## 11 v_lines      54.3   47.8      16.8      26.9  -0.0694
## 12 wide_lines   54.3   47.8      16.8      26.9  -0.0666
## 13 x_shape      54.3   47.8      16.8      26.9  -0.0656

“never trust summary statistics alone; always visualize your data”

[obs not finished]

Transforming our data

Recall our gapminder dataset:

## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

It has rows and columns; observations and variables. Picture the gapminder dataset and:

  1. Create a new column called gdp, which is gdpPercapita \(\times\) pop.
  2. Only keep rows from 2007
  3. Then remove the ‘year’ column.

So: what happened to our gapminder dataset?

  1. It got wider by adding a new variable:

  1. It got shorter by removing all the years that weren’t 2007:

  1. And it got thinner by removing a variable:

We’ll explore through each of these steps slowly.

1. Adding a new variable

First, we created a new column called gdp.

This uses the function mutate, which works like this:

This means: take mydata and add a column newvar, which is 10 for every observation. See how we use one equals sign = to define something. You can read this as: newvar IS 10. (We’ll look at what two equals signs == means in the next section).

Thinking about the gapminder dataset, we could say that we wanted—for some reason—to make everyone richer with a everyone_richer variable that was the current GDP per capita gdpPercap \(\times\) 1000:

## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap everyone_richer
##    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>           <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.         779445.
##  2 Afghanistan Asia       1957    30.3  9240934      821.         820853.
##  3 Afghanistan Asia       1962    32.0 10267083      853.         853101.
##  4 Afghanistan Asia       1967    34.0 11537966      836.         836197.
##  5 Afghanistan Asia       1972    36.1 13079460      740.         739981.
##  6 Afghanistan Asia       1977    38.4 14880372      786.         786113.
##  7 Afghanistan Asia       1982    39.9 12881816      978.         978011.
##  8 Afghanistan Asia       1987    40.8 13867957      852.         852396.
##  9 Afghanistan Asia       1992    41.7 16317921      649.         649341.
## 10 Afghanistan Asia       1997    41.8 22227415      635.         635341.
## # … with 1,694 more rows

Great—everone is richer! But note that this is not stored anywhere, the dataset with the everyone_richer variable was just printed on your screen. But we saw it do an important thing: for each observation, it took whatever the value of gdpPercap was and multiplied that value by 1000.

To get to the thing we were trying to do, add a gdp variable, we can use the mutate function and make sure we define it as an object:

As above, this will take the gapminder dataset and add a new variable gdp which is equal to each observations GDP per capita multiplied by its population.

To make sure this as all worked, we can print the head of our dataset:

## # A tibble: 6 x 7
##   country     continent  year lifeExp      pop gdpPercap          gdp
##   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>        <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
## 2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
## 3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
## 4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
## 5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
## 6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.

And, like we have done many times before, we can visualise it:

Note that here we have defined an object called this_plot with our plot, and then called this_plot by writing it.

We could make it easier to explore interactively by again using ggplotly:

2. Filtering to keep only observations from 2007

Next, we want to filter our dataset to only keep observations from 2007. i.e. we want to know what the state of the world was before the global financial crisis. To do this, we use the (suprise!) filter function. Our first argument is the dataset we want to do something to, and we follow that by a condition:

filter(original_data, [CONDITION])

Conditional statements

A conditional statement is one for which some things are TRUE and some things are FALSE. A quick example of conditionals is below.

10 does NOT equal 20, so the ‘answer’ to this is FALSE:

## [1] FALSE

See how we are using two equals signs == to declare something is equal to. You can read the above as: 10 IS EQUAL TO 20.

But 10 DOES equal 5 \(\times\) 2, so the this is TRUE:

## [1] TRUE

We use two equals signs == to require something to be true, whereas we use a single equals sign = to declare something as true, which is why we use it to define new variables. For example:

## [1] FALSE

The first line reads “declare (or assign) x as 10”, which stores x in our envrironment and doesn’t prouduce any output. The second line reads “x IS EQUAL TO 2”, which is not true (because we said that x was equal to 10). It produces the somewhat-aggressive output FALSE.

We can also use the does not equal sign != to say “this DOES NOT EQUAL that”. Below we are saying that 10 DOES NOT EQUAL 5\(\times\)2

## [1] FALSE

[potentially go into detail on conditionals because the examples can be funs]

Anyway: we want to filter our data to only those observations for which the year of the observations IS EQUAL TO 2007, or as we have learnt: year == 2007. The filter function does this:

And we can quickly check if that has worked by looking at the gap_gdp07 dataset, selecting $ only the year variable:

##   [1] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
##  [15] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
##  [29] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
##  [43] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
##  [57] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
##  [71] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
##  [85] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
##  [99] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
## [113] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
## [127] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
## [141] 2007 2007

This reads: Take the gap_gdp07 dataset and choose $ the year column.

It looks like they’re all 2007—which is exactly what we wanted. We could also wrap the code above in the unique function, which removes any duplicates and only shows unique values:

## [1] 2007

This reads: I want to look at each unique value (i.e. remove any duplicates) of the year column $ of the gap_gdp07 dataset.

Great! There’s only one unique year number in the gap_gdp07 dataset. Just what we wanted.

3. Removing the year variable

We can choose and drop variables in our dataset using the select function. This comes in handy when you’re working with large datasets and your poor computer only has so much memory. Since we have already filtered our dataset to only include observations from 2007, we can drop the year variable.

This reads: define gap_gdp07_noYear as the gap_gdp07 dataset and negatively select (drop) the year variable.

Wonderful! We have done the three things we wanted: we have one dataset that adds the gdp variable; the next that only keeps observations from 2007; and a final one that removes the year variable.

But, creating all these obscure datasets is odd. There is a better way.

Piping it all together

This %>% is a pipe! The pipe is an odd concept and it is wonderful. Think of the things we’ve just done, where our goal was to create a new variable, keep observations from 2007 and drop the year variable:

We’ve created a whole bunch of objects that we don’t really care about. We can neatly put this together with pipes %>%.

A pipe works by taking the thing behind it and making it the first argument in the function after it. So if we were simply adding 5 + 7 and then wanted to take the square root of that number, we could define an object:

And take the square root of that object:

## [1] 3.464102

OR we could pipe %>% our number into the square root sqrt function:

## [1] 7.645751

The pipe %>% takes the things behind it and makes them the first argument in the next funtion. This is useful! Because we can do all the things we wanted to do in our three-step program in one:

Verbally, this says: assign gapminder07 to the original gapminder dataset, but add a column called gdp, then filter to only include observations from 2007, then drop the year variable.

This ‘piping’ means we can pretty quickly filter and adjust graphs. Recall that the ggplot function needs a dataset as its first argument, from Part 1:

So: data is the first argument. Whatever we pipe %>% into it will be the data argument. Which means we can use our new filtering skills before we plot something, and pipe %>% it into our ggplot:

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at 80.199
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 2.6574e-05
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 80.199
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.005155
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at 81.24
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 2.6574e-05
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 2.6574e-05
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger
## Warning: Computation failed in `stat_smooth()`:
## NA/NaN/Inf in foreign function call (arg 5)

Joining datasets together

We have our gapminder dataset that contains information about countries in different years. Now, let’s say we have another dataset—maps—that contains information about the geometry (i.e. map information) about those countries. We could use both of those datasets together to plot a map of lifeExp around the world if only we could join the two datasets together.

We can do this by joining our new map dataset with our original gapminder dataset using left_join8

First, we read the maps dataset into our environment. Note that this time the data format is Rds, so we use the read_rds function9

What, exactly, is this object? We can use the class function to ask:

## [1] "sf"         "data.frame"

It’s an sf (‘simple features’) object and a data.frame object! This is just like a tibble object, but with a special geometry variable. The geometry variable contains—surprise!—a bunch of spatial information about each observation. The geometry column contains a series of lines that make up the border of each country: one small straight line that connects to another small straight line that connects to another small straight line, and so on.

We would like to connect this spatial map data to our gapminder dataset. We can do this with by taking the original (left) dataset gapminder and connecting the map dataset to it. By using the function left_join:

We are saying:

  • In the first row of gapminder look for a match in the dataset map, by country. If you find one, connect all columns in map. If not, move on.
    • Here, it will look for Afghanistan in the map object and it will find one observation and it will connect all those Afghanistan map variables to our dataset.
  • In the second row of gapminder look for a match…(it will do the same thing for each row).

[A more detailed explanation about joining datsets and relational data can be found at https://r4ds.had.co.nz/relational-data.html]

Making maps

We have a dataset that contains information about various countries (gdpPercap, lifeExp, etc). We joined it to a dataset that contains spatial data for those countries in the geometry variable. This means we can use ggplot like we have been doing before, but this time with a different geom: geom_sf, which will plot your geometries (ie plot your map).

We want to plot GDP per capita (gdpPercap) in 2007. So we first filter our data

## Warning: package 'sf' was built under R version 3.5.2
## Warning: Removed 19 rows containing non-finite values (stat_sf).
## Warning: Computation failed in `stat_sf()`:
## no applicable method for 'st_bbox' applied to an object of class "list"

We can see some issues with our map. We have a bunch of missing data: mainly from eastern Europe and central Afria. We also


  1. It can be accessed for free here: https://r4ds.had.co.nz

  2. Stata has made some grounds in this area, allowing .ado files written by users to be shared and used. But this is not near the levels of user-written functions in R.

  3. http://www.r4stats.com/articles/popularity/ (the potential bias is indicated in its domain)

  4. Adapted from https://www.r-bloggers.com/5-ways-to-subset-a-data-frame-in-r/

  5. You can read Hadley Wickham’s _A Layered Grammar of Graphics here. This section is a quick and incomplete summary.

  6. This is not a key feature of data science (yet), but it is fun.

  7. You can check out more of his work here: https://github.com/thomasp85

  8. You can explore the full gamut of joning types here: https://dplyr.tidyverse.org/reference/join.html. A very handy superhero-themed set of examples is here: https://stat545.com/bit001_dplyr-cheatsheet.html

  9. This is also from the readr package in the tidyverse. An rds file has the suffix .Rds and is made to be read into R. You can find out more about this kind of file here: https://readr.tidyverse.org/reference/read_rds.html